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CROSS-VALIDATION IN NONPARAMETRIC REGRESSION 

WITH OUTLIERS 

By Denis Heng-Yan Leung 

Singapore Management University 

A popular data-driven method for choosing the bandwidth in 
standard kernel regression is cross-validation. Even when there are 
outliers in the data, robust kernel regression can be used to esti- 
mate the unknown regression curve [Robust and Nonlinear Time Se- 
ries Analysis. Lecture Notes in Statist. (1984) 26 163-184]. However, 
under these circumstances standard cross-validation is no longer a 
satisfactory bandwidth selector because it is unduly influenced by 
extreme prediction errors caused by the existence of these outliers. 
A more robust method proposed here is a cross-validation method 
that discounts the extreme prediction errors. In large samples the 
robust method chooses consistent bandwidths, and the consistency 
of the method is practically independent of the form in which ex- 
treme prediction errors are discounted. Additionally, evaluation of 
the method's finite sample behavior in a simulation demonstrates 
that the proposed method performs favorably. This method can also 
be applied to other problems, for example, model selection, that re- 
quire cross-validation. 

1. Introduction. Since Nadaraya [20] and Watson [27] first proposed us- 
ing the kernel method for curve estimation, there have been numerous in- 
vestigations about its theory and application [9]. When there is evidence 
that the data may be contaminated with outliers, robust kernel regression 
is effective in modeling the underlying curve [4, 7]. 

The standard error criterion for evaluating a statistical estimator is to de- 
termine how close it is to the true parameter. In nonparametric regression, 
a few popular criteria are the integrated squared error, the mean integrated 
squared error, the average squared error and the mean average squared error. 
Each of these criteria gives a measure of the distance between the regres- 
sion and the unknown curve "averaged" over the range of the independent 
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variable. In practice, any one of the above criteria can be used as they are 
all asymptotically very similar [9]. On the other hand, the criteria are all 
critically dependent on the bandwidth. The bandwidth selection problem is 
to find the (optimal) bandwidth that is optimal in the sense that the chosen 
error criterion is minimized with respect to all bandwidth choices. 

This paper focuses on bandwidth selection using the cross-validation 
method [11, 14, 21, 23]. The cross-validation method provides an estimate 
of the prediction error of the regression, and that, in turn, is an approxima- 
tion of the error criterion. The framework of the method is straightforward. 
For each observation in the data, the method evaluates the prediction error 
using kernel regression with that observation removed from the modeling 
process. The optimal bandwidth is chosen to minimize the sum of squares 
of the prediction errors from all of the observations. Cross-validation is one 
of a larger class of bandwidth selection methods called data-driven meth- 
ods, in which the Akaike Information Criterion [1] and Shibata's criterion 
[25] are examples. All these methods give asymptotically similar bandwidths 
[11] and, therefore, this investigation will only focus on the cross-validation 
method. 

In standard kernel regression, where there are no outliers in the data, 
cross-validation has been shown to produce bandwidths that are asymptoti- 
cally consistent [11, 12]. However, when there are outliers in the data, it has 
been demonstrated in simulations that the use of cross-validation can lead 
to extremely biased bandwidth estimates [17]. Arguably, the reason stan- 
dard cross-validation fails, even when applied using a robust regressor [7], is 
because it no longer produces a reasonable estimate of the prediction error. 
Therefore, a robust cross-validation method may be superior. The robust 
cross-validation method proposed here only differs from the average squared 
error by a constant shift and a constant multiple; both of which are asymp- 
totically independent of the bandwidth. Hence, the robust cross-validation 
rule asymptotically selects a bandwidth that optimizes the average squared 
error (or any other asymptotically equivalent error criterion). Similar meth- 
ods have been suggested by Leung, Marriott and Wu [17] in kernel smooth- 
ing and by Ronchetti, Field and Blanchard [24] in linear model selection. 
Further, Wang and Scott [26] suggested an L\ cross-validation in nonpara- 
metric regression. One weakness of these previous works is that they did not 
provide any analytical analysis of the methods. Boente, Fraiman and Me- 
loche [2] considered a robust plug-in estimator that is based on minimizing 
the mean integrated squared error of the robust smoother. But that method 
is not fully automatic in the sense that a "pilot" bandwidth is required. 

The rest of this paper is organized as follows. The method and its large 
sample results are given in Section 2. A simulated example and some finite 
sample simulation results are reported in Section 3. The method is applied 
to a real dataset in Section 4. A discussion of the results is given in Section 5. 
Proofs are given in the Appendix. 
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2. Main results. Let {x\, yi), . . . , (x n , y n ) be a set of data and consider 
the regression of Y on X at the n design points x±, . . . ,x n , 

(1) yi=m(xi) +£i, i = l,...,n, 

where m(-) is an unknown functional of X and {ej,i= l,...,n} are i.i.d. 
random noise with distribution F(-). Without loss of generality, it is assumed 
that the range of X is [0, 1]. 

A robust smoother m(-) of m(-) is defined by 

n 

n" 1 ^2 p{yi - rh(xi)} = min, 
i=i 

where p(-) is an even function with bounded first derivative ?/>(•). Conse- 
quently, the robust smoother can also be defined as the zero of 

n 

(2) E^^fi-O, 

i=l 

where aj(x) is a weight function based on a kernel K and the bandwidth 
h. The choice of p is problem-specific. A well-known choice can be found 
in [16]. Some common forms of the weight function ai(x) are 




t = 0, t n = 1, tj = l/2(xi + x i+ i) 



for i = 1, . . . , n — 1 ([5]), 

(Xi-xi-l^lirfc^) for i = 2,...,n ([22]). 

Similarly, the leave-one-out-smoother, m_j(-) of m(-), is defined as the zero 
of 

(3) J2 a i( x )^(yj - •)• 

While there are a number of studies that demonstrate the effectiveness 
of the robust smoother, m, in estimating m [4, 7], the result of directly 
applying the cross-validation rule, even using the robust smoother, is not 
as satisfactory. In fact, simulation results [17] showed that the conventional 
cross-validation rule, 

n 

CRVD(h) = n- 1 Y^iVi ~ rh.^Xi)} 2 , 
i=i 
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behaves unsatisfactorily when the data is contaminated with outliers, a sit- 
uation when robust smoothing is most needed. The form of the CRVD sug- 
gests the root of the problem. The CRVD is a sum of squares of the prediction 
errors of the smoother at each of the design points. When there are outliers, 
some of the prediction errors will be uncharacteristically extreme and these 
extreme prediction errors will affect the performance of the CRVD. There- 
fore, the extreme prediction errors should be discounted, as if they were 
extreme observations from a set of data. Based on this argument, a robust 
cross-validation rule can be defined as 

n 

(4) RCRVD(h) = n~ 1 Y / p{y i -rn-i(x l )}, 

i=l 

where /?(•) is a function whose role is similar to p. RCRVD(h) can be used 
as a surrogate for 

n 

ASE(h) = n~ l J2{m(xi) ~ m( Xi )} 2 
i=i 

and 

MASE(h) = E{ASE(h)\ Xl , . . . ,x n }. 

The choice of p and p will be discussed further. It will be demonstrated 
that (4) is a reasonable bandwidth selector, using the following assump- 
tions: 

(Al) F(-) is symmetric about zero. 

(A2) The kernel K is symmetric about zero. Furthermore, K is positive, 
Lipschitz continuous and satisfies 

(5) [ K(t)dt = l. 

(A3) The function m : [0, 1] — > (c, d) E 1Z is twice differentiable with 

/ {m (2) (x)} 2 dx < oo and m (p) (0) = m (p) (l), p = 0,l,2,..., 
Jo 

where denotes the pth derivative of m. 
(A4) The bandwidth sequence h depends on n and satisfies h — > 0, nh — > oo 
as n — > oo. 

(A5) The function p is a continuous function symmetric about zero and is 
differentiable everywhere except possibly at a finite number of points. 
(A6) There are constants cq,c\ > such that, for x G [0, 1], 

\Epip(y - m(x) + s)\ > c \s\, \s\ < ci, 

where ip is the derivative of p. 
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(A7) There exists a constant C2 such that, for x G [0, 1], 

\E F ip(y - m(x) + s)| < c 2 \s\. 
(A8) E F [t/;(-)] 2 <oo;E F W(-)}2<oo. 

Remark. For a symmetric noise distribution F, assumptions (A6) and 
(A7) are satisfied for ip{u) = u. For nonlinear ip functions, (A6) is satisfied 
if E F ijj'(y — m{x) + s) > cq and (A7) is satisfied if Epxjj'(y — m(x) + s) < C2 
for small s. Therefore, these conditions ensure that Epip' '(y — m(x) + s) is a 
positive bounded value. 

Hereafter, the notation is simplified by writing m, instead of m(xj); rrij 
instead of m(xj); m_j instead of m^i(xi); and E instead of Ep. Furthermore, 
the names c and q are generic names for constants; they may represent 
different values in different contexts. 

One difficulty in working with the robust estimate, rhi, is the representa- 
tion of it in a workable form. From the definition of rhi, it can expanded in 
a Taylor series about rrii as 

n n 

(6) 'Y^aj{x i )i)(yj - mi) + ^aj(xi){mi - rhiWiVj ~ m + »i) = 0, 
where < \rrii — fhi\. Therefore, if there exists a constant q > such that 



(7) sup 

h 



OLj(xi)i>' (yj - mi + Ui)-q 

3=1 



a.s., 



then, on the event V = {inf^^^i otj(xi)ijj' (yj — mi + v<j) > 0}, which occurs 
a.s., rhi can be represented as 



(8) mi 



Y,j=ittj(xi)ip{yj -mi) 



which is a linear combination of the random variables ip(yj — mi). This result 
can be shown by using (A6), (A7) and making use of Theorem 2 in [13] and 
Theorem 2 in [28]. 

Similarly, m_j can be represented as 

(9) mi ir . 



Proposition 1. If the conditions (Al)-(A8) are satisfied, then on the 
event V and over an interval [dn -1 / 5 , C t 2 n ~ 1 ^] for some suitable Ci>C2> 

(10) RCRVD(h) = c + ^^1mASE{K) + o p (n" 4 / 5 ) 

uniformly in h, where c is a constant w.r.t. h. 
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Proposition 1 shows that RCRVD differs from MASE by a constant shift 
and a constant multiple. Neither of these is dependent on the bandwidth. 
Hence, asymptotically, minimizing RCRVD and MASE w.r.t. h are equiva- 
lent. This result is shown in Proposition 3 below. 

Proposition 2. If the conditions (A1)-(A8) are satisfied, then 
ASE(h) = MASE(h) + Op(n" 4 / 5 ) 

uniformly in h. 

Proposition 3. If the conditions (A1)-(A8) are satisfied, 
/icrvd = arg mini? CRVD(h), 

h 

/iase = argmin ASE(h), 

h 

hyiASE = argmin MASE{h) 
h 

are all equivalent as n — > oo. In particular, they are all equal to kn~ 1 ^ , 
where 

= r jK\u)duE^(.) 1 V5 

1 ' [ti{m( 2 ){x)} 2 dxfu 2 K{u)du{Ei)i(-)¥_ 

Note that k depends on ip rather than ijj, and hence the bandwidth se- 
lected by RCRVD is asymptotically independent of the choice of tp (or p). 

3. Simulation study results. A simulation study on the finite sample 
performance of RCRVD was performed. In this study the following regression 
function, m(x), was used: 

(12) m(x) = sin(27rx), < x < 1. 

The observations Y{ were taken at Xi = i/n, for n = 257, and £j was an 
error from one of the following three distributions: (1) A(0,0.2); (2) Con- 
taminated normal 0.9A(0, 0.2) + 0.1iV(0, 1.8) and (3) Contaminated normal 
0.8^(0,0.2) + 0.2A(0, 3), where N(fi,a) stands for the normal distribution. 
The program for computing the robust smoother and the cross-validation 
rules was written in Fortran 90, incorporating the algorithm of Hardle [8]. 
Since Hardle's algorithm used a fast Fourier transform in the computations, 
the function was evaluated on an interval with the number of grids a power 
of 2, which was set to an upper limit of 1024. Hence, the choice of the sample 
size of 257 (= 2 8 + 1) in the simulation was for computational convenience. 
Of course, the method works for any other sample size. A standard normal 
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kernel was used for all smoothing. Robustness in smoothing was achieved 
by setting p to Huber's p [16], 




if \u\ < c, 
if \u\ > c, 



where the threshold c = 0.5 was chosen to obtain the appropriate degree of 
robustness. 

When using RCRVD, a second p, not necessarily the same as p, is used to 
discount the extreme prediction errors. In this study three different choices 
of p were considered: (1) Huber's p with c = 0.5; (2) Huber's p with c= 1; 
(3) p(u) = \u\. The choice p(u) = \u\ is equivalent to the L\ cross-validation 
method of Wang and Scott [26]. We note that, asymptotically, the different 
choices of p all give consistent bandwidths. But in finite samples, using 
different choices of p may lead to different bandwidths. 

First, the behaviors of the RCRVDs and CRVD were studied in a sim- 
ulated example. One sample of 257 observations was generated using each 
of the three error distributions described above. The data, along with the 
robust smoother (m) using the optimal bandwidth, /iase> were plotted in 
Figures 1(a), 2(a) and 3(a). Additionally, the values of the RCRVDs, CRVD 
and ASE were plotted as functions of the bandwidth, h, in Figures 1(b), 2(b) 
and 3(b). Since the optimal bandwdith chosen by a particular bandwidth 
selector is not affected by a constant shift in the value of the selector, in Fig- 
ures 1(b), 2(b) and 3(b) the plots of CRVD and RCRVD have been shifted 
by constant amounts for ease of comparison. When there were no outliers 
[Figure 1(b)], the minima of all bandwidth selectors (CRVD and all three 
RCRVDs) were close to that of ASE. When the data were contaminated 



(a) (b) 



x 10 




RCflVD.plM.I 

_ 5 I , . , , 1 -si 1 . 

0.2 0.4 0.6 0.8 1 -* -3-5 -3 -25 

X Value of log h 

Fig. 1. (a) AT(0,0.2) data with robust smoother using Hase- (b) Plot of ASE, CRVD and 
RCRVDs vs. logh. 
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[Figures 2(b) and 3(b)], the minima of CRVD were very different from those 
using ASE. The minima of the RCRVDs, on the other hand, were still very 
similar to those of ASE, even under heavy contamination [Figure 3(b)]. 

In the simulation study 100 samples of 257 observations were generated 
from each of the three noise distributions described above. Eight different 
bandwidth selection methods were considered in this study. These methods 
included the standard CRVD and the three different versions of RCRVD 
considered above. In addition, four selectors based on the robust plug-in 
method of Boente, Fraiman and Meloche [2] were included. The plug-in 
method they considered was based on finding /iplug-in = kn~ 1 ^ , where k is 
given by (11). Following Boente, Fraiman and Meloche [2], in the expression 
for /iplug-in; Eijj 2 {-) I 'E 2 tp' '(•) was replaced in (11) by 




Fig. 3. (a) 0.8AT (0,0.2) +0.2iV(0,3) data with robust smoother using h A SE- (b) Plot of 
ASE, CRVD and RCRVDs vs. log h. 
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where medj stands for median over i, i = 1, . . . ,n — 1. The quantity 
was evaluated by estimating m^(x) on a fine grid and 
then applying numerical integration. The quantity m^ 2 \x) was estimated 
using a robust kernel regression for derivatives ([2], page 119) using a "pi- 
lot bandwidth," ho, which had to be determined subjectively To examine 
the sensitivity of the plug-in method to the choice of ho , four choices of ho , 
0.02, 0.03, 0.04 and 0.06 were considered. 

For each sample in the simulation, we calculated the quantity 

/1oN I ^method ~ ^ASe| 



where method = CRVD, RCRVD or PLUG-IN. Ideally, a method should 
choose a bandwidth close to that chosen by ASE. Therefore, for a good 
method, the distribution of (13) should have most of its density clustered 
around zero. Based on the results from the simulations, the frequency distri- 
bution of (13) was tabulated for the different methods (Table 1). Each entry 
in Table 1 gives the number of times out of 100 simulations that the quan- 
tity (13) fell within the interval given in the column heading. For example, 
for CRVD, under iV(0,0.2) data, there were 33 times out of 100 simulations 
that (13) was smaller than 0.1. 

The results of the simulations showed that when the data were normal [Ta- 
ble 1(a)], all cross-validation based methods (CRVD and RCRVD) behaved 
similarly and satisfactorily, that is, the quantity (13) was less than 0.6 in 
most of the simulations, indicating that the methods chose bandwidths close 
to those given by ASE in most of the simulation runs. The performances of 
the plug-in methods, however, were quite different. Among the four plug- in 
methods, only the one using a pilot bandwidth, ho = 0.04, gave comparable 
performance to the cross-validation methods, while the other three gave very 
disappointing results. This pattern of performance by the plug-in method 
was consistent throughout the study, giving rise to implications later. 

For mildly contaminated normal errors [Table 1(b)], the cross-validation 
methods started to diverge in their performances. CRVD often chose a band- 
width that was much smaller than that selected by ASE, resulting in moder- 
ately large values of (13) in a lot of the samples. For RCRVD, there was no 
evidence of any impaired performance using either Huber's p with c = 0.5 or 
p(-) = | • |, even though the performance using Huber's p with c = 1 yielded 
less satisfactory results. The performance of CRVD became worse when the 
proportion and severity of contamination were high [Table 1(c)]. Among the 
RCRVDs, the performance was better in the selector with p(u) = \u\. 

To gain more insight into the behaviors of the different methods, the 
Monte Carlo mean and standard deviation of the optimal bandwidth chosen 
by the different methods are summarized in Table 2. From this table the 
following observations can be noted. First, despite the poor performance of 
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Table 1 

Frequency distribution of |/i mo thod — /iaseI/^ase for the various bandwidth selectors: 100 
data sets of size 257 with three different error distributions 



Method 




Value of \h 


method 


/iase|//i 


ASE 






<0.1 


0.1-0.2 0.2-0.4 


0.4-0.6 


0.6 0.8 


0.8-1 


1-2 


>2 


(a) Error distribution: AT (0,0.2) 
















CRVD 


33 


20 


33 


8 


6 











RCRVD, Huber's p, c = 0.5 


32 


21 


33 





Q 
O 


n 
u 


n 
u 


n 
U 


RCRVD, Huber's p, c=l 


33 


20 


33 


Q 

8 





U 


U 


n 

U 


RCRVD, p(u) = |u| 


27 


27 


31 


1 




U 


1 
1 


n 
U 


PLUG-IN, h =0.02 








14 


UO 


1 Q 


n 
l J 


U 


n 
U 


PLUG-IN, h =0.03 


18 


18 


49 


1 - 

10 


n 
U 


U 


A 

9 


n 
U 


PLUG-IN, h =0.04 


38 


31 


27 


1 


n 
U 


U 


9 


n 
U 


PLUG-IN, ho = 0.06 


17 


18 


26 


1 o 
19 


1 i 
1 1 


- 




i 


n 
U 


(b) Error distribution: 0.9JV(0, 0.2) + 0.1A/(0, 1. 


8) 












CRVD 


20 


13 


28 


19 


8 


5 


7 





RCRVD, Huber's p, c = 0.5 


43 


17 


25 


11 


4 











RCRVD, Huber's p, c = 1 


27 


20 


27 


10 


6 


4 








RCRVD, p(u) = |u| 


36 


25 


23 


11 


3 


2 








PLUG-IN, ho =0.02 








3 


40 


56 


1 








PLUG-IN, ho =0.03 


6 


7 


36 


51 














PLUG-IN, ft =0.04 


23 


25 


43 


9 














PLUG-IN, h = 0.06 


28 


20 


29 


11 


6 











(c) Error distribution: 0.8iV(0, 0.2) 


+ 0.2iV(0, 3) 














CRVD 


5 


15 


18 


25 


17 


10 


8 


2 


RCRVD, Huber's p, c = 0.5 


36 


26 


21 


5 


8 


1 


3 





RCRVD, Huber's p, c=l 


18 


20 


33 


16 


8 


2 


3 





RCRVD, p(it) = |u| 


38 


29 


17 


6 


6 





4 





PLUG-IN, ho =0.02 








1 


12 


71 


16 








PLUG-IN, h =0.03 





4 


14 


52 


29 


1 








PLUG-IN, ft =0.04 


8 


7 


31 


50 


3 


1 








PLUG-IN, ho =0.06 


23 


32 


32 


9 


2 





2 






CRVD, its means were very similar to the corresponding means under ASE. 
However, its standard deviation, especially under heavy contamination (last 
column of Table 2), was much higher than those of ASE and the other se- 
lectors. This result indicates that CRVD sometimes chose bandwidths that 
were quite far away from those using ASE. Second, the standard deviations 
of all plug-in methods were small, which is consistent with results seen in 
other applications of the plug- in method (cf. [6]). However, for these meth- 
ods the means were very sensitive to the choice of the pilot bandwidth, 
/io- These observations indicate that the poor performance of the plug- in 
methods results from the bias that can arise from an inappropriate choice 
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Table 2 

Mean and standard deviation of the optimal bandwidths for the various bandwidth 

selectors: 

100 data sets of size 257, with three different error distributions 



Method 



ASE 
CRVD 

RCRVD, Huber's p, c = 
RCRVD, Huber's p, c = 
RCRVD, p(u) = M 
PLUG-IN, ho = 0.02 
PLUG-IN, h =0.03 
PLUG-IN, h =0.04 
PLUG-IN, h = 0.06 



AT(0,0.2) 



0.0471 (0.0000507) 
0.0488 (0.0000457) 
0.5 0.0481 (0.0000558) 
1 0.0488 (0.0000460) 
0.0477 (0.0000847) 
0.0228 (0.0000058) 
0.0349 (0.0000101) 
0.0461 (0.0000175) 
0.0620 (0.0000478) 



Error distribution 



0.9N (0,0.2) 
+ 0.1iV (0,1.8) 



0.0575 (0.0001232) 
0.0593 (0.0003014) 
0.0581 (0.0000799) 
0.0585 (0.0000976) 
0.0588 (0.0001253) 
0.0222 (0.0000084) 
0.0346 (0.0000090) 
0.0466 (0.0000185) 
0.0659 (0.0000448) 



0.8AT(0,0.2) 
+ 0.2iV(0,3) 



0.0735 (0.0002968) 
0.0675 (0.0026117) 
0.0744 (0.0001589) 
0.0703 (0.0002314) 
0.0758 (0.0001811) 
0.0199 (0.0000199) 
0.0327 (0.0000245) 
0.0437 (0.0000559) 
0.0665 (0.0000771) 



Mean (SD) Mean (SD) Mean (SD) 



of Jiq. Therefore, for the plug-in methods a good choice of pilot bandwidth 
is cruical. Third, in contrast to the other methods, all the RCRVDs seemed 
to choose unbiased bandwidths, and the standard deviations, even though 
higher than those of the plug-in methods, were not unreasonably large. 

4. Application to water-quality data. The tremendous development in 
Florida from the 1960s to the 1990s and the changes in the demand in and 
practice of water use over that period had caused a decrease in water supply 
and quality in the 1990s. In response to these problems, in 1996 the U.S. 
Geological Survey conducted a study on the long term water quality trend 
in Southern Florida. In this section the methods considered in this paper are 
applied to data from the U.S. Geological Survey. The data were collected at 
two discharge stations — one within Big Cypress National Preserve and one 
near Biscayne Bay [18]. 

A large number of water-quality constituents and flow data were collected 
periodically between 1966-1994 and used in the survey. The methods in this 
paper are illustrated using the dissolved solids data collected at Tamiami 
Canal station inside the Big Cypress Preserve. The data included the data 
that were used in the survey and also data collected up to 1999. The total 
number of observations was 118. The raw data of dissolved solids level (in 
mg/1), as a function of time, were plotted in Figure 4(a). The plot clearly 
shows outliers in the data. A robust smoother was fit by setting p to Huber's 
p. The value of c in p was obtained by first fitting a LOWESS curve [3] to 
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5100 
5050 
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4950 
4900 
4850 
4800 
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I 
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(b) 



CHVD 

RCRVD, Huberp, c=64 
RCRVD, pOI.I 



3 3.5 
Value of log ft 



(c) 



t r: r 



n 1 1 r 



■ ■ . ■ CRVD 

._ RCRVD, Huburp, c=64/ p(.)=l.l 

PLUG-IN, h =50 

_ _ PLUG-IN, h =60 




1968 1970 1972 1974 1976 1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 

rear 



Fig. 4. (a) Scatterplot of dissolved solids as a function of time at the Tamiami Canal 
station, (b) Plot of CRVD and RCRVDs vs. log/i. (c) Smoothers for dissolved solids as a 
function of time at the Tamiami Canal station. 
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the data. A smoothing parameter of 0.5 was used for the LOWESS fit, which 
was the value used by Lietz [18]. The value of c = 1.345 x (med, |res,|)/0.645, 
where resj was the residual of the ith observation from the LOWESS plot, 
was used for p. Using the data, the value of c = 64 was obtained. 

The following methods for bandwidth selection were considered: (1) CRVD; 
(2) RCRVD with Huber's p,c = 64; (3) RCRVD with p(u) = \u\; (4) the plug- 
in method with pilot bandwidth ho = 50; and (5) the plug-in method with 
ho = 60. 

The robust smoother was evaluated on a grid with intervals of 50 days. 
Therefore, all results were quoted in units of 50 days. Figure 4(b) gives the 
CRVD and the two RCRVDs as functions of the bandwidth. It is clear that 
the minimum of CRVD appeared much earlier than the other two curves. 
The optimal bandwidths, obtained by grid searches, were 9 (= 9 x 50 days), 
29 and 29, respectively, for CRVD, RCRVD with Huber's p,c = 64, and 
RCRVD with p(u) = \u\. The smaller minimum of CRVD made sense; the 
few extreme values in the dataset caused CRVD to suggest a rougher curve to 
downweight the extreme prediction errors. The optimal plug-in bandwidths 
were 18 and 20.4, respectively, for ho = 50 and 60. 

The robust smoothers based on the different bandwidths are plotted in 
Figure 4(c). The rough curve (dotted line) is based on the bandwidth us- 
ing CRVD. The other curves are similar to each other. The results show a 
downward trend in dissolved solids over time at Tamiami Canal station. 

5. Conclusion. This study demonstrates that, in data suspected of con- 
taining outliers, using standard cross-validation can lead to bandwidths that 
are not optimal with respect to the usual error criteria. On the other hand, 
the robust cross-validation method suggested in this article provides band- 
widths that are asymptotically optimal with respect to these criteria. Propo- 
sition 1 demonstrates that, asymptotically, RCRVDs using different forms 
of p(-) differ only by a constant. This finding suggests that all robust cross- 
validation methods that satisfy the assumptions of the proposition give the 
same asymptotic bandwidth. Asymptotically, the bandwidth chosen is op- 
timal with respect to MASE and ASE. The assumptions on the form of 
p(-) are all very mild. The most important assumption is the symmetry of 
p(-), which ensures deviations from either side of the unknown curve to be 
weighted equally. In small and moderate samples, however, different forms 
of /?(•) will give different results. From the derivation of Proposition 1, it is 
important to note that the extent of how well RCRVD approximates MASE 
or ASE depends on how well rhi and m_j are represented as (8) and (9), 
respectively. It has been shown that a p with a small value of f{ip^ 2 \')} 2 
will give satisfactory results. 

The method suggested here includes the L\ robust bandwidth selector 
of Wang and Scott [26] as a special case. Another alternative to the pro- 
posed method is the plug-in method suggested by Boente, Fraiman and 
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Meloche [2]. The simulations in this study show that, the bandwidth chosen 
using a plug-in method, despite having a smaller standard deviation than the 
cross-validation method, is highly sensitive to the choice of the pilot band- 
width. Similar observations were made by Boente, Fraiman and Meloche [2], 
Tables 1-4, though the sensitivity they observed was less severe. 

The method described here assumes a circular design in which there are 
no boundary problems. But the method can be easily adapted to work for 
a general regression function with the incorporation of a weight function 
that discounts the contribution of the boundary observations to the cross- 
validation; see, for example, [15]. The results for Propositions 1-3 will still 
hold under this situation. 

The method studied in this article is a general robust method. It can be 
applied to problems other than kernel smoothing where cross-validation is 
needed. For example, Ronchetti, Field and Blanchard [24] considered robust 
cross-validation for model selection in a linear regression. This method can 
be used in that application. 



APPENDIX: SKETCHES OF THE PROOFS 



Proof of Proposition I. Without loss of generality, it is assumed 
here that p = p. This assumption will simplify the notation in the proof 
considerably. The result of the proposition still holds when p^ p. The first 
step of the proof is to expand RCRVD in Taylor series. To show (10), it 
is required to demonstrate that the sup norms of the l.h.s. of (A.1)-(A.3) 
over the interval (dn -1 / 5 , C2^ -1 ^ 5 ) vanish. The general idea is to partition 
into many small subintervals and then "discretize" the 
sup norm problem into one of finding the maximum of a finite number of 
sup norms within these subintervals, which themselves can be bounded easily 
by established results [see, e.g., (A.5)-(A.8)], 

n 

RCRVD (h) = n~ l ^ pfa + m 8 - m_i); 
i=i 

expand p by Taylor series at the points £i,i= 1, . . . , n, 

n 

RCRVD (h) = rT 1 J2p( £ *) + D ^ h ) + D ^ h ) + D 3(fr), 

i=l 

where 



D 1 (h)=n 1 ^(m i - m_j)^(ej), 



i=l 

\2 
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and Ds{h) = n" 1 Ya=i Ri are the remainder terms in the Taylor series ex- 
pansions. To prove (10), it is required to demonstrate that 

(A.l) D 1 (h)=o p (n- 4 / 5 ), 
(A.2) D 2 (h) - ^^lMASE(h) = o p (n" 4 / 5 ), 

(A.3) D 3 (h)=o p (n-^ 5 ) 

uniformly in h. 

D\(h) can be written as 

n 

(A.4) D 1 (h)=q- 1 n- 1 J2J2 a ^ x ^)^)- 
Therefore, to show (A.l), for some e > 0, 
pfsup|-Di(/i)| >n~ 4 / 5 e 

(A.5) <P[ sup sup \Di(h e ) -Di(s e )| >n~ 4/5 e/ 2 

\ e=2,...,m s e _i<h e <s e 



+ p( sup |L>i(MI >™~ 4/ V 2 Y 

\e=2,...,m / 



where the intervals si < S2 < • • • < s m form a partition of the interval ((jn -1 / 5 , 
(^ti." 1 / 5 ) and the sizes of the intervals (s e -i, s e ), e = 2, . . . , m, are to be cho- 
sen small enough. In other words, if m is large enough, the first term on 
the r.h.s. of (A.5) becomes negligible compared to the second term. Writing 
T) = s/2, it is necessary to demonstrate 

(A.6) PI sup |£>i(M| >n~ A/5 ri) 

\e=2,...,m / 

goes to zero. By Bonferroni's inequality, (A.6) is bounded above by 



(A.7) m sup PQD^he)] >n- 4 / 5 r?) <m sup E(n 4 / 5 r ] ~ 1 \D 1 {h e )\) 2k , 

e=2,...,m e=2,...,m 

where k = 1, 2, 

Define = ip(ei),i = l,...,n. D\(h e ) is a quadratic form of the Zj's, 
PijZiZj, where flij = q~ 1 n~ 1 aj(xi), = and independent for 

i 7^ j. Now, using Theorem 2 of [28], 

fV/V'UMMI) 2 " 



<c 



n 8//5 (r]qnh e 
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(Ai 



i=i i 



2/,- 



X j X n 



l )IT. K 

l=\ 



<c 



n (-^S/ 5 )k h -2kJ2 n l h l e{E ^ {£i)} 2k 



1=2 



Note that, in the above the Nadaraya-Watson form of oti{x) is used. But 
the result still holds for Gasser-Muller, Priestley-Chao or other weights. 
Therefore, 



m sup ^(n^V^iOe)!) 

e=2,...,m 



2A- 



< m sup c 

e=2,...,m 



2k 



n(- 4+8 ^ k h- 2k Y,n l ti e {Ei; 2 (e t )} 2k 



1=2 



which goes to zero for fixed m, h e £L [s e _i, s e ] and n going to oo, if k is chosen 
to be sufficiently large. 

The derivation of (A. 2) is similar to that for (A.l). First write 

-■ n n 

D 2 (h) - ED 2 (h) = — ^(m - mi)V fa) - ^E^rm - fh^f ] Etf '(e,) 



i=i 

n 



i=l 



— ^{(rrii - rhi) 2 - E{rm - m_i) 2 }V>'(ei) 



i=\ 



E(rm - m.if^'iei) - E^'(e l )} 



8=1 



i=l i=l 

It will be demonstarted that n" 1 Ya=x W{ = o p (n~ 4 / 5 ) and n" 1 Ya=i W% = 
o p (n _4 / 5 ) uniformly in h. Since tp(-) is bounded, for n~ l Ya=1 ^ * s enou gh 
to show that [(mi — mi) 2 — _E(m,i — m_i) 2 ], i = 1, . . . , n, are o p (n~ 4//5 ). Each 
term is expanded as 



-i 2 



E 



1 2 



Y^Y^oLj{xi)ak{xi){il){yj -rrn) - E^yj - mi)} 



2 



+ 



(A.9) + 
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x {ip(yk - m) ~ Eip{y k - mi)}\ 

T,j^iT,k^i a j( x i) a k(xj)^{yj - mj)E^(y k - rrij) 
q 2 

T,j^iT,k=ii a j( x i) a k(xi)Hyk - mi)Eip(yj - rrij) 



Y J j^ 2 j {xi)[E^{y j - mi){Erl){y k - rrij)} 2 } 
q 2 

_ 2 T,j^iT l k=ii,k^j a j( x i) a k(xi)E'ilj(y j - mi)Eil)(y k - mi) 

x {ip(Vk ~ rrii) - Eip(y k - mi)}!/ q 2 
T l j^iT lk ^iaj(x i )a k (x i ){^(y j -rrii) - Eip(yj -rrii)}E^(y k -rrii 



+ 



q 2 



J2j^iJ2k^i a j( x i) a k( x i){^(yk - mi) -Eijj(y k -mi)}Eip(yj -rrij) 

2 

q 2 

_ J2j^jQ i j{ x i)ak{xi)[Eip 2 (y j -nn){E^(y k - rm)} 2 } 

q 2 

= Wt^h) + W{ 2 {h) + WUh) + W[ A (h). 

To bound W| 1 (/i), write rj = tp(yj — rrii) — Eip(yj — rrii). Note that rj, rj, i / j, 
are independent r.v.'s with zero mean. An application of Theorem 2 of [28] 
gives 

(A.10) P (sup \W^(h)\ > n- 4 ^e] < cmn^ 4+8 ^ k h- 2k Y,(nh e ) 1 . 

' 1=2 



V h 



Similarly, W{ 2 (h), W{ 3 (h), W[ A (h) can also be bounded as in (A.9). There- 
fore, Wf^h), W[ 2 {h), W{ 3 (h), W[ A (h) all vanish as long as a large enough 
k is chosen, as n — > oo and h G (^n -1 / 5 , C2 n_1//5 )> with m fixed. From [10], 
\E(rrii — rri-i) 2 — E(rrii — rfii) 2 \ = o p (n _4 / 5 ) and E(rrii — rrii) 2 = O p (ri _1 /i _1 + 
h 4 ) . Therefore, for e > 0, 



P[ sup 
V h 



i=i 
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(A.ll) 



<c 



min^K 1 + hi) k n^ A+ ^ k Y,{nh e ) 1 

1=2 



which goes to zero as k is chosen large enoug h. Hence, W^(h) = o p {n~^ 5 ), 
uniformly in h. Now, since E{D 2 ) = Etp' {-)MASE{h) /2 + Op(n~ 4: /^), (A.2) fol- 
lows from the above derivations. Finally, (A. 3) can be easily shown by the 
same techniques that are used to show (A.l) and (A.2). □ 

PROOF of Proposition 2. The proof follows the same route as that 
of Theorem 1 of [19]. □ 

Proof of Proposition 3. By standard technique, it can be shown 
that 

MASE(h) = (nh)- 1 [ K 2 {u)du E f^ 
J {Eip'{-)} 2 

(A.12) +^ j\m (2 \x)} 2 dx(^J u 2 K(u)d u y 

+ o{n- l h~ l + h A ) 

and argmin^ MASE(h) is given by /cn" 1 / 5 , where k is as in (11). Using a 
similar approach as in [23], 

D{e)= inf r n~^ 5 \RCRVD(u) - RCRVD(v)\. 

\u— v\>n~ 1 / 5 e 

It follows that 

P{\hcKDV ~ /*mase| > n~ 1/5 e) 

< p(n^ 5 sup \RCRVD(h M ASE) - RCRVD (h RC Rvv)\ > D{e 
\ h 

(A.13) 

RCRVD (/imase) - ^P-MASE(hMASE) 



<P[ n 4/5 sup 
h 



since 



and 



+ Ef^L MASE(h M ASE) - R CR VD ( /ircrvd ] 



MASE{h M ASE)=mmMASE(h) < MASE(h RC Rvn) 

h 



RCRVD (/iRCRvn) = min RCRVD(h) < RCRVD(h MASE )- 

h 



>D(e) 
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Therefore, (A. 13) can be written as 
P(\hRCRVD ~ ^mase| > n~ 1/5 e) 



<P n 4 / 5 



sup 

h 



RCRVD (/imase) - ^^-MASEihuASE] 



2 



+ P[ n 4 / 5 sup 

V h 

o, 



^^M^^(/i RC rvd) - RCRVD(h nC RVD] 



by Proposition 1. Similarly, -P(|/ircrvd-^ase| > n 1/5 e) -> and P(|/imase- 
frASE|>n- 1/5 e)-0. □ 
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